Skeleton benchmark 1.0 #399

bkorycki · 2024-08-01T16:19:11Z

The primary difference between 0.5 and 1.0 seems to be the inclusion of additional languages. WG1 says scores from different languages should not be aggregated, so I envision each language to be it's own benchmark. This will require some re-factoring of modelgauge hazards as well.

github-actions · 2024-08-01T16:19:24Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

wpietri

As we discussed in the standup, let's go for more with this. In particular, I'm hoping to see most or all of the bullet points in the issue: #398

wpietri · 2024-08-08T20:50:01Z

Oh, sorry, I didn't realize you had already bumped the modelgauge version in here when I started in on a PR for that. Let's get this merged and then maybe drop my PR if it's duplicative.

wpietri · 2024-08-08T20:55:46Z

Before I dive in to review, could you say how much of #398's bullet points are in this PR?

bkorycki · 2024-08-08T21:11:21Z

@wpietri

Before I dive in to review, could you say how much of #398's bullet points are in this PR?

at least 3 prompts ✅
- synthetic prompts from workstream 3? ❌ (they are fully fake)
- not ground truth prompts ✅
at least one hazard from workstream 1's definitions ✅ (dfm-- defamation)
1 test per hazard ✅
llama guard 2 to start ✅
hazard score is fraction unsafe ✅
personas are all combined ✅
benchmark scoring: use same reference models and approach as in 0.5, but in separate code ✅
benchmarks are separated by language✅ and persona❌? Start with english and normal-ish persona
- Modelgauge separates tests by language. So far there is just one for English.
- The personas are grouped together right now because there's some conflicting info regarding this point

wpietri

Looks great! Thanks for going the distance.

New benchmark class

81c14cb

bkorycki requested review from wpietri, bollacker and dhosterman August 1, 2024 16:19

bkorycki requested a review from a team as a code owner August 1, 2024 16:19

wpietri requested changes Aug 1, 2024

View reviewed changes

bkorycki added 2 commits August 6, 2024 18:07

Bump modelgauge version

38229d4

New benchmark uses new v1.0 hazard + more tests

da0254c

bkorycki requested a review from wpietri August 8, 2024 16:20

wpietri approved these changes Aug 8, 2024

View reviewed changes

bkorycki merged commit 81264dd into main Aug 8, 2024
4 checks passed

github-actions bot locked and limited conversation to collaborators Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skeleton benchmark 1.0 #399

Skeleton benchmark 1.0 #399

bkorycki commented Aug 1, 2024

github-actions bot commented Aug 1, 2024 •

edited

Loading

wpietri left a comment

wpietri commented Aug 8, 2024

wpietri commented Aug 8, 2024

bkorycki commented Aug 8, 2024

wpietri left a comment

Skeleton benchmark 1.0 #399

Skeleton benchmark 1.0 #399

Conversation

bkorycki commented Aug 1, 2024

github-actions bot commented Aug 1, 2024 • edited Loading

wpietri left a comment

Choose a reason for hiding this comment

wpietri commented Aug 8, 2024

wpietri commented Aug 8, 2024

bkorycki commented Aug 8, 2024

wpietri left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 1, 2024 •

edited

Loading